PART 1: Preliminary analysis

Introduction

In the period of 1991 to 2017, housing quality in New York has improved dramatically; however, some sectors of the housing stock continue to face poor conditions and some specific maintenance deficiencies continue to show higher prevalence. In this project, we develop an index that presents poor qualtity of housing in New York by measuring the physical deficiencies to show how the prevalence of these issues has shifted over time.

Methodology

The index measures weighted sums of interactions between 22 variables that the authors chose. The selected variables were chosen if the authors agreed they described poor housing conditions. The index is not exhaustive, and potentially more data could be collected to better suit our purpose.

Item Description NYCHVS Variable Score
1 Exterior Walls: Missing brick, sliding or other d1 2
2 Exterior Walls: Sloping or bulgin walls d2 2
3 Exterior walls: Major Cracks d3 2
4 Exterior Walls: Loose or hanging corvice, roof, etc. d4 2
5 Interior Walls: Cracks or holes 36a 2
6 Interior Walls: Broken plaster or peeling paint 37a 2
7 Broken or missing windows e1 5
8 Rotten or loose windows e2 2
9 Boarded up windows e3 3
10 Sagging or sloping floors g1 2
11 Slanted/shifted doorsills or frames g2 2
12 Deep wear in floor causing depressions g3 2
13 Holes or missing flooring g4 2
14 Stairs: Loose, broken, or missing stair f1 2
15 Stairs: Loose, broken, or missing setps f2 2
16 No interior steps or stairways f4 2
17 No exterior steps or stairways f5 2
18 Number of heating equipment breakdowns 32b 2 per break down
19 Kitchen facilities fucntioning 26c 3 if no, 5 if no kitchen facilities
20 Toilet Breakdowns 25c 3 if any, 5 if no toliet or plumbing
21 Presence of mice or rats 35a 3
22 Water Leakage 38a 3

Visualization

Figure 1 shows the poor quality index scores for the 156,230 occupied units in the New York Housing Dataset from 1991 to 2017. The frequency distribution is skewed to the right. Overall, fourty five percent of the units were scored 0. The highest score was in 1993 with 54 points. 2008 had the highest percent(64%) of units that has 0 poor quality scores.

Figure 2 shows percent the percent of ccupied units with poor quality scores. Over the period of 1991 to 2017, most of the units has poor quality scores between 1 and 10 points; very little units that has the poor quality scroes over 20 points.

Figure 3 tracks trends in poor quality index scores during the period of 1991 to 2017. We decided to report the means, medians, 75th percentiles, 95th percentiles, and 99th percentiles. In most of the years, the median had the poor quality scores of 0. The mean ranged from 4.0 in 1991 to 2.5 in 2017. The 99th percentiles clearly show the improvement of housing in New York( from 25 poor quality points in 1991 to 18 porr quality points in 2017)

Figure 4 shows the poor condition of housing in five different boroughs in New York city in the period of 1991 to 2017. Overall, all five boroughs had an improvement of the house quality. Bronx had the worse housing condition and Stalen Island had the best housing condition.

Figure 5: Average Household Income and Index by Sub-borough

## OGR data source with driver: GeoJSON 
## Source: "/Users/thienngole/Desktop/MSU/10-MSU-Spring-2019/MTH390Q-DataScience/project/NY-Housing-Data/Community Districts.geojson", layer: "OGRGeoJSON"
## with 71 features
## It has 3 fields

Limitations and Future Plans

Ultimately, we did not arrive at a method to test our index. However the authors believe the index should be validated against a variable indicative of quality, but not measured in the index. It is in future plans to find data to perfom such a validation test. Potential variables were omitted due to the fact they only had data for recent years. Whether a unit has functioning air conditioning was only measured during the years 2014 and 2017. In measuring housing quality this variable would have been useful. The authors’ have chosen not to include such datas it may inflate the index scores for later years. However, there are plans to create strong indexes for the recent years.

In this paper we have created a housing quality index that measures poor housing conditions. We remark that housing conditions have been slowly improving over time, particularly among units with high index values. Our goal was to measure hosuing quality and our proposed index specificaly measures poor housing conditions rather than just quality. We believe it would be benefecial to creat several indexes concering qualilty of hosuing e.g., High Quality Index, Neighborhood Quality Index and to consider all such indexes when considering hosuing quality. We also reccomend further exploration of the spatial component of the data to see if things such as crime, location, and the index value related.

PART 2: Statistical analyses

Multiple regression analyses

## [1] 152313     35
## 
## Call:
## lm(formula = pqi ~ hhinc + X_30a + X_31b, data = data_part2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.400  -3.736  -1.544   1.989  49.613 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.766e+00  2.952e-02 161.446   <2e-16 ***
## hhinc        9.114e-08  2.491e-07   0.366    0.715    
## X_30a        6.945e-04  7.086e-05   9.800   <2e-16 ***
## X_31b       -1.762e-03  6.938e-05 -25.390   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.083 on 77284 degrees of freedom
##   (75025 observations deleted due to missingness)
## Multiple R-squared:  0.03649,    Adjusted R-squared:  0.03646 
## F-statistic: 975.7 on 3 and 77284 DF,  p-value: < 2.2e-16

Logistic regression analyses

## 
## Call:
## glm(formula = X_25c ~ hhinc + X_30a + X_31b, data = data_part2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2536  -0.1270  -0.1208  -0.1094   1.0239  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.352e-01  1.980e-03  68.278  < 2e-16 ***
## hhinc       -3.651e-08  1.697e-08  -2.152   0.0314 *  
## X_30a        1.873e-05  4.731e-06   3.959 7.53e-05 ***
## X_31b       -3.328e-05  4.637e-06  -7.176 7.26e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1058602)
## 
##     Null deviance: 7439.9  on 70105  degrees of freedom
## Residual deviance: 7421.0  on 70102  degrees of freedom
##   (82207 observations deleted due to missingness)
## AIC: 41526
## 
## Number of Fisher Scoring iterations: 2

Machine learning procedure

1=“Bronx”, 2=“Brooklyn”, 3=“Manhattan”, 4=“Queens”, 5=“Staten Island”

##       Predicted
## Actual     1     2     3     4     5
##      1  5952  2605  1483   993   104
##      2  1762 12410  2434  1663   152
##      3  1266  2625 11872  1708   145
##      4   917  1989  1646  7892   167
##      5   138   324   309   504   508

Correct classification rate

## [1] 0.2536487

References

https://www.huduser.gov/publications/pdf/AHS_hsg.pdf

https://www1.nyc.gov/site/hpd/about/nychvs-asa-data-challenge-expo.page